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Abstract 

The complexity of the visual world creates significant 
challenges for comprehensive visual understanding. In 
spite of recent successes in visual recognition, today’s vi¬ 
sion systems would still struggle to deal with visual queries 
that require a deeper reasoning. We propose a knowledge 
base (KB) framework to handle an assortment of visual 
queries, without the need to train new classifiers for new 
tasks. Building such a large-scale multimodal KB presents 
a major challenge of scalability. We cast a large-scale MRF 
into a KB representation, incorporating visual, textual and 
structured data, as well as their diverse relations. We in¬ 
troduce a scalable knowledge base construction system that 
is capable of building a KB with half billion variables and 
millions of parameters in a few hours. Our system achieves 
competitive results compared to purpose-built models on 
standard recognition and retrieval tasks, while exhibiting 
greater flexibility in answering richer visual queries. 


1. Introduction 

Type the following query in Google (i.e., a search en¬ 
gine) - “names of universities in Manhattan”. The returned 
list of answers is often sensible. But try this one - “names 
of universities with computer science PhD program in Man¬ 
hattan”. The answers are far from satisfying. Both ques¬ 
tions are perfectly clear to most humans, but current NLP- 
based algorithms still fail to perform well for more com¬ 
plex queries. In vision, we see a similar pattern. Much 
progress has been made in tasks such as classification and 
detection on single objects (e.g., Fig. 1(a)). But real-world 
vision applications might require more diverse and hetero¬ 
geneous querying needs (e.g., Fig. 1(b)). The traditional 
classification-based methods would struggle in such tasks. 

Towards the goal of scaling up the large-scale, diverse 
and heterogeneous visual querying tasks, a handful of re¬ 
cent papers [7, 59] have suggested to cast the visual recog¬ 
nition tasks into a framework that enables more heteroge- 


(a) Find me pictures of a dog. 



Q: Find photos of me sea kayaking last Halloween in my photo album. 



Figure 1: Although a classification-based method might be suf¬ 
ficient to find images of a dog in query (a). It would struggle for 
queries in real-world applications. To answer the queries in (b), 
we need to fuse visual information with metadata for joint reason¬ 
ing. We propose a visual knowledge base framework to perform 
different types of visual tasks without training new classifiers. Our 
framework allows one to express this complex task with a single 
query. 

neous reasoning and inference. A major benefit of doing 
so is to avoid training a new set of classifiers every time 
a new type of questions arises. We approach this problem 
by building a large-scale multimodal knowledge base (KB), 
where we answer visual queries (like the ones in Fig. 1(b)) 
by evaluating probabilistic KB queries. 

A KB can often be viewed as a large-scale graph struc¬ 
ture that connects different entities with their relations [38, 
5! ]. In NLP, some early promising results have been shown 
by encoding entity and relation information in text-based 
KBs, e.g., Freebase [ 3 ] and IBM Waston’s Jeopardy sys¬ 
tem [13]. In vision, there is now a small but growing amount 
of attention in building visual KBs. In NEIL [7], Chen et 
al. have shown the benefit of using contextual relations be¬ 
tween scenes, objects and attributes to improve scene classi- 


(b) 

Q: Where can I find similar cuisines in downtown Chicago? 
Answers: 



U. U. Grill 
Chicago, IL 60642 




S. C. Steak House 
Chicago, IL 60657 
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fication and object detection. However, its testing scenario 
is limited on recognition-based tasks; while it lacks a co¬ 
herent inference model to extend to richer high-level tasks 
without training new classifiers. Zhu et al. [59] have shown 
how to build a Markov Logic KB for affordance reasoning. 
However, their testing scenario is limited by its small data 
size and the discrete representation. Our paper is particu¬ 
larly inspired by these two works [7, 59], but focuses on 
addressing the following two key challenges. 

First, answering a variety of heterogeneous visual 
queries without re-training. In real-world vision appli¬ 
cations, the space of possible queries is huge (even infi¬ 
nite). It is impossible to retrain classifiers for every type 
of queries. Our system demonstrates its ability to perform 
reasoning and inference on an assortment of visual query¬ 
ing tasks, ranging from scene classification, image search to 
real-world application queries, without the need to train new 
classifiers for new tasks. We formalize answering these vi¬ 
sual queries as computing the marginal probabilities of the 
joint probability model (Sec. 5). The key technique is to ex¬ 
press visual queries in a logical form that can be answered 
from the visual KB in a principled inference method. We 
qualitatively evaluate our KB model in answering applica¬ 
tion queries like the ones in Fig. 1 (Sec. 6.1). We then 
perform quantitative evaluations on the recognition tasks 
(Sec. 6.2) and retrieval tasks (Sec. 6.3) respectively using 
the SUN dataset [50]. Our system achieves competitive re¬ 
sults compared to the classification-based baseline models, 
while exhibiting greater flexibility in answering a variety of 
visual queries. 

Second, learning with large-scale multimodal data. 

To build such a scalable KB, the model needs to perform 
joint learning and inference on a large amount of images, 
text and structured data, especially by using both discrete 
and continuous variables. Existing text-based KB represen¬ 
tations [15, 38, 5 ] fail to incorporate continuous visual fea¬ 
tures in a probabilistic framework, which hinders us from 
expressing richer multimodal data. In vision, MRFs have 
been widely used as a probabilistic framework to model 
joint distributions among multimodal variables. We cast 
a MRF model into a KB representation to accommodate a 
mixture of discrete and continuous variables in a joint prob¬ 
ability model. While MRFs have been widely used in a 
variety of vision tasks [9, 26, 27, 46], applying them to 
a large-scale KB framework means that we need to con¬ 
quer the challenge of scalable learning and inference. We 
build a scalable visual KB construction system by lever¬ 
aging database techniques, high-speed sampling [55] and 
first-order methods [35]. We are able to build a KB with 
half billion variables and four million parameters, which is 
four orders of magnitude larger than Zhu et al. [59] while 
using half of its training time. 


2. Previous Work 

Joint Models in Vision A series of context models have 
leveraged MRFs in various vision tasks, such as image seg¬ 
mentation [16, 27, 33], object recognition [9, 26], object 
detection [46], pose and activity recognition [5^ ] and other 
recognition tasks [20, 36]. Similarly, the family of And-Or 
graph models [47, 5i ] focus on parsing images and videos 
into a hierarchical structure. In this work, we use an MRF 
representation for joint learning and inference of our data, 
casting MRF models into modem KB systems. In particu¬ 
lar, we address the scalability challenge of large-scale MRF 
learning with our knowledge base construction system. 

Learning with Vision and Language Previous work on 
joint learning with vision and language abounds [23, 30, 41, 
42, 60]. Image and video captioning has recently become a 
popular task, where the goal is to generate a short text de¬ 
scription for images and videos [8, 11,21,29,44,48,51]. It 
is followed by visual question answering [1, 14, 31, 32, 53], 
which aims at answering natural language questions based 
on image content. Both captioning and question answering 
tasks perform on a single image and produce NLP outputs. 
Our system offers one single, coherent framework that can 
perform joint learning and inference on one or multiple im¬ 
ages as well as metadata in textual and other forms. 

Knowledge Bases Most KB work in the database and 
NLP communities focuses on organizing and retrieving 
only textual information in a structured representation [3, 
13, 28, 5 ]. Although a few large-scale KBs [3, 1 ] have 
made attempts to incorporate visual information, they sim¬ 
ply cache the visual contents and link them to text via hy¬ 
perlinks. In vision, a series of work has focused on extract¬ 
ing relational knowledge from visual data [5, 39, 60]. Chen 
et al. [ ], Divvala et al. [1C] and Zhu et al. [59] have re¬ 
cently proposed KB-based frameworks for visual recogni¬ 
tion tasks. However, they all lack an inference framework 
to deal with more diverse types of visual queries. PhotoRe- 
call [21 ] proposed a pre-defined knowledge structure to re¬ 
trieve photos from text queries. In contrast, our system al¬ 
lows for new KB structures and offers the flexibility of an¬ 
swering richer types of queries. 

3. A Joint Probability Model: Casting a Large- 

Scale MRF into a KB System 

Our first task is to build a system that can efficiently learn 
a KB given a large amount of multimodal information, such 
as images, metadata, textual labels, and structured labels. 
Towards a real-world, large-scale system like this, the chal¬ 
lenges are two-fold. First, our learning system must allow 
for a coherent probabilistic representation of both discrete 
and continuous variables to accommodate the heterogene¬ 
ity of the data. Second, we need to develop an efficient but 
principled learning and inference method that is capable of 


large-scale computation. We address the first property in 
this section, and the second in Sec. 4. 

3.1. The Knowledge Base System 

A KB can be intuitively thought of as a graph of nodes 
connected by edges as in Fig. 2, where the nodes are called 
“entities” and the edges are called “relations”. In vision, 
MRFs have been widely used to represent such graph struc¬ 
tures [20, 33, 36, 4(]. Thus, we cast an MRF model as 
the KB representation, where entities are represented by 
variables and relations by edges between variables. This 
model provides an umbrella framework for answering vi¬ 
sual queries, where we formalize query answering as evalu¬ 
ating marginals from the joint distribution (Sec. 5). In com¬ 
parison to MLNs used in previous work [38, 59], this repre¬ 
sentation is more generic, allowing us to accommodate con¬ 
tinuous random variables and real-valued factors. In prac¬ 
tice, we use factor graphs [24, 49], a bipartite graph equiva¬ 
lence of an MRF. Factor graphs provide a simple graphical 
interpretation of the MRF model, resulting in ease of imple¬ 
mentation for large-scale inference. 

A factor graph has two types of nodes: variables and fac¬ 
tors. A possible world is a particular assignment to every 
variable, denoted by I. We define the probability of a pos¬ 
sible world / to be proportional to a log-linear combination 
of factors. We assign different weights to factors, express¬ 
ing their relative influence on the probability. Formally, we 
define the partition function Z of a possible world I as 


Z[I] = exp 


m 


( 1 ) 


where wi is the weight of the i-th factor, fi (/) is the value of 
the i-th factor in possible world /, and m is the total number 
of factors. The probability of a possible world is 

Pr[J;w] = Z[i]^Z[J']^ (2) 



Figure 2: A graphical illustration of a visual knowledge base 
(KB). A visual KB contains both visual entities (e.g., scene im¬ 
ages) and textual entities (e.g., semantic labels) interconnected by 
various types of edges characterizing their relations. The nodes 
and edges correspond to the variables and factors respectively in 
the factor graph. The colors indicate different node (edge) types. 


ical gradients, as it involves the computation of an expecta¬ 
tion over all possible words. We use the contrastive diver¬ 
gence scheme [19] to estimate the log-likelihood gradients. 
The gradient of the weight of the i-th factor (omitting 
regularization) is approximated by: 

Vwi « fi(I') - /»(/") (4) 

where I' is a possible world sampled from the training data, 
and I" is a possible world sampled under the distribution 
formed by the model (parameterized by w). Gibbs sam¬ 
pling [6, 17] is used as the transition operator of the Markov 
chain. Intuitively, the first term in Eq. (4) increases the prob¬ 
ability of training data; and the second term decreases the 
probability of samples generated by the model. In-depth 
studies on the estimated gradients of Eq. (4) can be found in 
the context of RBM training [4, 45]. We show in Sec. 4 that 
our system automatically creates a factor graph and learns 
the weights in a principled and scalable manner. 

3.2. Data Sources for the Knowledge Base 


where X is the set of all possible worlds, and w corresponds 
to the factor weights. In Fig. 2, each node corresponds to 
a variable; and each edge between nodes corresponds to a 
factor. We define all the factors used in our KB in Sec. 3.2. 

Having defined the structure of the factor graph KB, our 
learning objective is to find the optimal weight 

w*=argmin— log Pr[7; w] + A| Iwl \\ (3) 

iex E 

where Xe is the set of possible worlds obtained from the 
training images and A is the regularization parameter. To 
optimize Eq. (3), we need to compute the stochastic gradi¬ 
ent . It is usually intractable to compute the analyt¬ 


We now describe the entities and relations in our KB, and 
the data sources that we will use to populate the KB. For our 
purposes, SUN [50] is a particularly useful dataset because 
of a) its diverse set of images, and b) the availability of a 
large number of category and attribute labels. 

Entities can be thought of as descriptors of the images. In 
the factor graph depicted in Fig. 2, they are the nodes (vari¬ 
ables) of the graph. 

Images - are represented by their 4096-dimensional ac¬ 
tivations from the last fully-connected layer in a convolu¬ 
tional network [54]. In total, there are 59,709 images from 
the SUN dataset [50], where half are used for building the 
KB, and half for evaluation. 





Scene category labels - indicate scene classes. In our 
experiments, we use 15 basic-level categories (e.g., work¬ 
place and transportation), and 298 fine-grained level cate¬ 
gories (e.g., grotto and swamp) from SUN [50]. 

Attribute labels - characterize visual properties (e.g., 
material, layouts, lighting, etc.) of a scene. We use the 
SUN Attribute Dataset [37], which provides 102 attribute 
labels (e.g., glossy and warm). 

Affordance labels - describe the functional properties of 
a scene, i.e., the actions that one can perform in a scene. We 
use a lexicon of 227 affordances (actions). 1 We conducted 
a large-scale online experiment to annotate the possibilities 
of the 227 actions for each scene category. We provide the 
list of affordances in Sec. C in the supplementary material. 

Relations link entities (variables) to each other, as depicted 
by the squares on the edges in Fig. 2. The weights learned 
for the edges (factors) indicate the strength of the relations. 
We introduce three types of relations in our model. 

Image - label - maps image features to semantic labels. 

Intra-correlations - capture the co-occurrence between 
attribute-attribute and affordance-affordance pairs. 

Inter-correlations - characterize correlations between 
two different types of labels (category - affordance, affor¬ 
dance - attribute, category - attribute and relations between 
categories from different levels). 

The entities and relations in the KB are mapped to vari¬ 
ables and factors in the factor graph. We represent the image 
entities as continuous variables, and the label entities as dis¬ 
crete variables. Each image is associated with hundreds of 
attribute and affordance labels. Together, this amounts to a 
KB of millions of entities. Table 1 summarizes some of the 
basic statistics of the KB that will be learned. This is two 
orders of magnitude larger than previous work [5 ( ] regard¬ 
ing the number of entities and relations. The large size of 
our dataset presents a significant challenge of scalability. In 
theory, an MRF can be arbitrarily large. However, its scala¬ 
bility is subject to the inefficiency of learning and inference. 
In addition, it is prohibitive to handcraft such a large-scale 
model from scratch. We, therefore, need a principled and 
scalable system for constructing the visual KB. 


Table 1: KB Dataset Statistics 



Attributes 

Affordances 

Lexicon size 

# Total labels 

# Positive labels 

# Positive / image 

102 

1.34 x 10 6 
9.6 x 10 5 
6.7 

227 

1.36 x 10 7 
1.23 x 10® 
13.7 


1 from the American Time Use Survey (ATUS) [40] sponsored by the 
Bureau of Labor Statistics, which catalogs the actions in daily lives and 
represents United States census data 


4. Learning the Large-scale KB System 

Given our goal towards learning a real-world, large-scale 
MRF-based KB system, the biggest challenge we need to 
address here is efficient learning and inference. A num¬ 
ber of recent advances have been made in the database 
community to shed light on how to build a large-scale 
KB [12, 13, 34]. Our framework follows closely that of 
Niu et al. [34]. In addition to that, we address the challenge 
of learning with multimodal data. Our KB system and the 
data will be made available to the public. 

4.1. Scalable Construction 

There are three key steps to make the knowledge base 
construction (KBC) scalable: data pre-processing, factor 
graph generation and high-performance learning. Fig. 3 of¬ 
fers an overview of the KBC process illustrating these three 
steps, which are indicated by the boxes. 

Data Pre-processing The first step (the first box in Fig. 3) 
is to pre-process raw data into a structured representation, in 
particular, as tables in a relational database. Each database 
table stores the entities of the same type (e.g., the Affor¬ 
dance table in Fig. 3(a)). It provides us access to database 
techniques such as SQL queries and parallel computing, im¬ 
portant to achieve high scalability. We provide the database 
schema in Sec. A in the supplementary material. 

Factor Graph Generation We represent the MRF model 
by a factor graph for the ease of implementation for scalable 
learning. The factor graph is generated from the database 
tables (the second box in Fig. 3). Each row in the database 
tables corresponds to a variable in the factor graph. For 
each training image, we construct a factor graph, where 
the variables (blue circles in Fig. 3(b)) are linked to their 
values in the database (dashed lines between Fig. 3(a) and 
(b)). We then define the factors on these variables. It is pro¬ 
hibitive to handcraft a large KB structure. Instead, we de¬ 
velop a declarative language that allows us to define the fac¬ 
tors with a handful of human-readable rules. This language 
is a simple but powerful extension to previous work like 
MLNs [38] and PRMs [15], which enables us to specify re¬ 
lations between multimodal entities in logical conjunctions. 
We show an example rule in Fig. 3(b). This rule describes 
co-occurrence between affordance label travel and at¬ 
tribute label sunny on image II. It evaluates to 1 if both 
labels are true and 0 otherwise. The KBC system parses 
this rule and creates a factor /*. on these two variables. A 
weight Wk is assigned to this factor and will be learned in 
the next step. The system creates a small factor graph for 
each of the training images. There is no edge between these 
graphs; however, the same factors in the graphs share the 
same weight (illustrated by the red squares in Fig. 3(c)). 
The weight sharing scheme is also specified in the declar¬ 
ative language. We provide a detailed explanation of the 








data pre-processing 


factor graph generation 


high-performance learning 



scene categories 

canyon, outdoor 

affordances 

eating & drinking, travel, 
watching rock climbing, 

attributes 

warm, natural, clouds, 
rock 


3 


relational database 


Image table 

sample feat dim 

value 

11 1 

0.729 

11 2 

0.341 

Affordance table 

sample affordance 

label 

11 travel 

true 

11 cooking 

false 

Attribute table 

sample attribute 

label 

11 natural 

true 

11 sunny 

false 


human-readable rules 

has Afford a nee (II, travel)AhasAttribute(Il, sunny) 



variables factors 


(a) 


(b) 



Figure 3: An overview of the knowledge base construction pipeline. We first process the images and text, converting them into a 
structured representation. We write human-readable rules to define the KB structure. The system automatically creates a factor graph by 
parsing the rules. We then adopt a scalable Gibbs sampler to learn the weights in the factor graph. 


declarative language and a complete list of rules in Sec. A 
in the supplementary material. 

High-Performance Learning Having defined the factor 
graph structures, our goal is to learn the factor weights effi¬ 
ciently. We use the learning method in Sec. 3.1 to find the 
optimal factor weights. We built a Gibbs sampler for high- 
performance learning and inference that is able to handle 
multimodal variables. Our system performs scalable Gibbs 
sampling based on careful system design and speedup tech¬ 
niques. On the system side, we implemented the Hog wild! 
model [35, 55] which can run asynchronous stochastic gra¬ 
dient descent while still guaranteeing convergence. The 
system runs in parallel, allowing the sampler to achieve a 
high efficiency. On average, our Gibbs sampler processes 
8.2 x 10 7 variables per second. Finally this step produces a 
learned visual KB. 

4.2. Learning Efficiency 

The three steps (described in Sec. 4.1) together con¬ 
tribute to the high scalability of our KBC system. Table 2 
shows that with this framework, we can build a KB four or¬ 
ders of magnitude larger regarding the number of variables 
and three orders of magnitude larger regarding model pa¬ 
rameters compared to [59] (using Alchemy MLNs [38]), 
in half of the time. Fig. 4 further demonstrates that the 
learning time grows steadily as the KB size increases. The 
end-to-end construction finishes in 5.2 hours on the whole 
dataset (Sec. 3.2), indicating the potential to build larger- 
scale KBs in the future. 


Table 2: Statistics of the Visual KB Systems 



variables 

parameters 

runtime 

Zhuetal. [59] 
Ours 

3.15 x 10 4 
5.76 x 10 8 

5.06 x 10 3 
4.19 x 10 6 

10 hr 

5.2 hr 



■ Our system 
♦ Zhu et al. [59] 




io i — . — . — . — . — . 

10 4 10 5 10 6 10 7 10 8 10 9 
Factor graph size (#nodes) 

Figure 4: Efficiency of the knowledge base construction sys¬ 
tem. The curve is plotted in log-log scale, where the a>axis is the 
number of nodes in the factor graph, and the y -axis is the runtime 
to construct the KB. 

5. Visual Query Setup 

As we have mentioned in the introduction, one advan¬ 
tage of using a KB system is its ability to handle rich and 
diverse types of visual queries without training new classi¬ 
fiers. Moreover, this inference is done in one joint model 
without step-wise filtering, treating images and other meta¬ 
data on an equal footing in learning and inference. From 
a user’s perspective, the input to this system is a natural 
language question along with a set of one or more images. 
Similarly, the output is a mixture of images and text. 

In practice, the space of possible queries is huge. It 
would be prohibitive to map each natural language ques¬ 
tion to the corresponding inference task in an ad-hoc man¬ 
ner. One solution is to reformulate the questions in a for¬ 
mal language [2], such as a probabilistic query language 
based on conjunctive queries [43]. This language allows 
us to express KB queries and to compute a ranked list of 
answers based on their marginal probabilities. We briefly 
describe how this works by an example query that retrieves 


















































images of a sunny beach. This query is formed by a con¬ 
junction of two predicates (Boolean-valued functions) of 

sceneCategory and hasAttribute: 

sceneCategory(i, beach) A hasAttribute(i, sunny) 

Given such a query, our task is to find all possible images i 
where both predicates are true - i.e., image i comes from the 
scene category beach and has the attribute sunny. Fol¬ 
lowing this example, more complex queries can be formed 
by joining several predicates together . 2 

Let Q be a conjunctive query such as the one above. We 
compute a ranked list of answers (e.g., images of sunny 
beaches) based on their marginal probabilities. Formally, 
the marginal probability of a tuple t (a list of variable as¬ 
signments) being an answer to Q is defined as: 

Pr[i e Q] = y^l te Q(/) • Pr[J;w] (5) 

iei 

where X and Pr[J; w] are defined in Eq. (2), 1 is the indica¬ 
tor function, and Q(I) is the set of variable assignments in 
the possible world I under which Q evaluates to true. We 
use the same Gibbs sampler as in Sec. 4.1 to estimate tu¬ 
ple marginals by sampling a collection of possible worlds 
and averaging the query values over these possible worlds. 
Each query evaluation produces a set of tuple-probability 
pairs {(£i,Pi), (£ 2 ^ 2)5 •• •}> where we retrieve the top an¬ 
swers by sorting the pairs based on their probabilities in a 
descending order. 

6. Experiments 

Now that we have learned a large KB from multimodal 
data sources, and have established a probabilistic language 
to express visual queries, we can demonstrate how a KB can 
be useful in a number of querying tasks. To demonstrate the 
utility of our KB, we perform several types of evaluations 
that involve vision tasks with multimodal answers including 
images, text and metadata. 

6.1. Answering Queries of Diverse Types 

We start with a qualitative demonstration of using the 
KB to answer a wide variety of queries by performing joint 
inference on image appearance, as well as metadata like ge¬ 
olocations, timestamps, and business information . 3 Fig. 5 
provides a few examples that depict the rich queries the 
system can handle. A user can ask the KB a question in 
natural language, such as “find me a modern looking mall 

2 In this work, we manually annotate the conjunctive queries from nat¬ 
ural language questions. The mapping from sentences to logical forms is a 
well-studied problem in NLP [2] and orthogonal to our system. 

3 These metadata are either acquired from existing databases or auto¬ 
matically scraped online. Detailed descriptions of the experimental setups 
and the conjunctive queries (Sec. 5) for Fig. 5 are provided in Sec. B in the 
supplementary material. 


q Find me a modern looking mall q 
—*• near Fisherman's Wharf. —■* 

Westfield Mall 

9 CA 94103 


Union Square 

9 CA 94108 


Japantown San Francisco 

9 CA 94115 


Diamond Heights 

9 CA 94131 



Find me a place in Boston where 
I can play baseball. 




Ringer Playground 

9 MA 02134 


Tadpole Playground 

9 MA 02134 


Apple Store 

9 MA 02116 


Macy's 
9 MA 02111 




Find me a hotel in Boston with 
new furniture. 


Q Find me a cozy bar to drink beer 
—■* near AT&T Plaza. 



Hotel A 

® From $136 
^ (617) 236-XXXX 

Hotel B 

® From $195 
^ (617) 536-xxxx 

Hotel C 

£!) From $89 
(617) 651-xxxx 

Hotel D 

From $52 
C (617) 249-xxxx 



Subterranean 

$$ 

C (773) 278-XXXX 

Bub City 

® $$ 

^ (312) 610-xxxx 


Fuel 

® $$ 

^ (847) 251-xxxx 

The Note 

$$ 

t* (773) 365-xxxx 




Find me a sunny and warm beach 0 Find me pictures of sunny days of 
during Christmas Day last year. Seattle during August. 



Newport Beach, CA 

£ 50-67 °F 
9 N33°39' W117°59' 

Huntington Beach, CA 

£ 50-67 °F 
9 N33°36' W117°53' 

Manomet Beach, MA 

£ 37-60 °F 
9 N50°29' W96°58' 

Winnipeg Beach, Canada 
£ 11-18°F 
9 N41°53' W70°32' 



2012 - 08-25 


2013 - 08-08 



2009 - 08-10 


2013 - 08-23 


Figure 5: Proof-of-concept queries in a query answering ap¬ 
plication. We incorporate external data to enrich our knowl¬ 
edge base, and demonstrate its flexibility in answering real-world 
queries. 


near Fisherman’s Wharf.” While the photos of the malls are 
not part of the training data in Sec. 3.2, our system is capa¬ 
ble of linking the photo contents to other metadata, and is 
able to offer the names and locations of the shopping malls. 
Similarly in the second example “find me a place in Boston 
where I can play baseball”, our system predicts the affor- 
dances from the appearances of the photos, and combines 
them with geolocation information to retrieve a list of places 
for playing baseball. In Fig. 5, the answers are shown in a 
ranked list by their marginal probabilities. Without a princi¬ 
pled inference model, previous work such as NEILL [7] and 
LEVAN [10] cannot produce such probabilisitic outputs. 
















Table 3: Performance of Scene Classification (in mAcc) 


Method 

Basic level 

Fine-grained 

CNN Fine-tuned [54] 

89.1 

67.5 

Attribute-based model 

88.0 

57.9 

Attributes + Features 

90.2 

69.6 

KB - Affordances 

90.0 

69.3 

KB - Attributes 

90.7 

69.6 

KB - Full 

91.2 

69.8 


6.2. Single-Image Query Answering 

While our KB is designed for answering a wide range 
of queries, we can still evaluate how our system performs 
quantitatively in several standard visual recognition tasks 
without re-training. Based on the KB we have learned from 
data sources such as SUN (see Sec 3.2), we show two ex¬ 
periments for scene classification and affordance prediction. 
Both of these two tasks can be thought of as answering 
queries for a single image, where these queries can be ex¬ 
pressed by a single predicate with the querying labels taken 
as random variables - i.e., sceneCategory(img, c) and 
hasAf f ordance(img, a). Our system outperforms the 
state-of-the-art baseline methods for each of these tasks. 

For both experiments, we use the data in Sec. 3.2 for 
training and an evaluation set of 29,781 images from the 
same 298 categories of SUN [5C] for testing. We mea¬ 
sure scene classification by mean accuracy (mAcc) over 
classes [57]. SUN [50] provides two ways of classifica¬ 
tion: basic-level (15 categories) and fine-grained (298 cat¬ 
egories). Table 3 provides a summary of the results, com¬ 
paring our full model (KB - Full) with a number of different 
settings and state-of-the-art models. We describe the mod¬ 
els used in Table 3 as follow: 

• CNN Fine-tuned We fine-tuned a CNN [54] on a sub¬ 
set of SUN397 dataset [50] of 107,754 images. We 
train -logistic regression classifiers on the activations 
from the last fully-connected layer. We also use this as 
image features for all the other baselines. 

• Attribute-based model We predict the scene at¬ 
tributes and affordances from the CNN features, and 
use a binary vector of the predicted values as an inter¬ 
mediate feature. This is the strategy adopted by Zhu et 
al. [59] to discretize visual data. 

• Attributes + Features We concatenate the predicted 
labels in Attribute-based model with CNN features as 
a combined representation. 

• KB - Affordance (Attributes) A smaller KB learned 
without affordances (attributes). 

• KB - Full Our full KB model defined in Sec. 3.2. 

The Attributes + Features model (the third row in Ta¬ 
ble 3) outperforms the Attribute-based model (the second 


Table 4: Performance of Scene Affordance Prediction 


Method 

mFl 

mAP 

CNN Fine-tuned [5- ] 
KB - Full 

81.6 

82.6 

74.2 

75.7 


row in Table 3) by 11.7%, indicating the importance of 
modeling continuous features in the KB. The full model 
KB - Full achieves the state-of-the-art performance on both 
basic-level and fine-grained classes with more than 2% im¬ 
provement over the CNN baseline. 

Fig. 6 offers some insight as to why a KB-based model 
performs well in a scene classification task. The class la¬ 
bel is one of the many labels jointly inferred and predicted 
by the KB system, including attributes and affordances. So 
to predict an auditorium , attributes such as indoor lighting , 
enclosed area , and affordances such as taking class for per¬ 
sonal interest can all help to reassure the prediction of an 
auditorium, and vice versa. 

As mentioned in Sec. 3.2, we have collected annotations 
of 227 affordance classes for each of the 298 scene cate¬ 
gories. We report the performance of affordance prediction 
by mean average precision (mAP) and mean FI score (mFl) 
over the 227 affordance classes. The results are presented in 
Table 4. Here we compare our full KB model with the CNN 
Fine-tuned model [54], where we trained an f^-logistic re¬ 
gression classifier on the CNN features for each of the 227 
affordance classes. The KB - Full model outperforms the 
CNN baselines on both metrics. 

Recall that the KB framework learns the weights of the 
relations between entities (e.g., scene classes, attributes and 
affordance, etc.) in a joint fashion. We can then exam¬ 
ine the strength of these relations by looking at the factor 
weights of the underlying MRF. A large positive weight be¬ 
tween two entities indicate a strong co-occurrence relation; 
whereas a large negative weight indicates a strong negative 
correlation. Fig. 7 provides examples of both the strongest 
and the weakest correlations between scene classes and at¬ 
tributes (Fig. 7(a)), as well as scene classes and affordances 
(Fig. 7(b)). For example, the KB has learned that the class 
beach has a strong co-occurrence relation with the attribute 
sand , and the class railroad track lacks correlation with the 
affordance teaching. 

6.3. Image Search by Text Queries 

Using the same model and framework, we can also query 
our KB for sets of images, instead of just one (Sec. 6.2), 
such as “find me images of a sunny beach .” Here we use the 
same dataset as in Sec. 6.2. This task can also be expressed 
by a single query where the image is taken as variables (see 
the example in Sec. 5). 

We randomly generate 100 queries of a single label 
(scene category, affordance or attribute), and 100 queries 














Class 


Affordances 


Attributes 



auditorium 

community and social 
work, taking class for 
personal interest, 
religious practices, 
waiting, attending the 
performing arts 

congregating, indoor 
lighting, spectating, 
enclosed area, glossy 



landing deck 

transportation and 
material moving work, 
in transit / traveling, 
military work 

transporting things or 
people, asphalt, 
natural light, far-away 
horizon, man-made 



candy store 


eating & drinking, food 
presentation, picking 
up / dropping off child, 
reading for personal 
interest, relaxing 

no horizon, cluttered 
space, dirty, eating, 
waiting in line 



basilica 

eating & drinking, 
attending or hosting 
parties, volunteer 
work, community and 
social work, religious 
practices 

open area, natural 
light, sunny, man¬ 
made, vacationing 



swimming pool 
aquatic theater 

entertainment / arts / 
design / sports / media 
work, personal care 
and service work, 
socializing 

still water, diving, no 
horizon, natural light, 
congregating 



bindery 
fly bridge 

boating, watching 
fishing, tobacco use, 
executive work, 
farming / fishing and 
forestry work 

metal, sunny, wire, 
man-made, natural 
light 


Figure 6: Sample prediction results by the full KB model. The ground-truth categories (in black) are shown in the first row. The first 
four images show examples of correct predictions from our KB model, and the last two show incorrect examples. As our model jointly 
infers multiple labels of an image, we show the predicted affordances (second row) in blue, and the predicted attributes (third row) in green. 


7.29 

beach 

sand 

5.68 

creek 

moist / damp 

5.65 

house 

shingles 

-3.29 

sun deck 

flowers 

-3.69 

apse indoor 

vinyl / linoleum 

-3.86 

gorge 

man-made 

(a) Top weighted relations between categories and attributes 

13.8 

mountain snowy 

hunting 

13.6 

mountain 

participating in equestrian sports 

12.5 

orchard 

physical care of children 

-0.94 

call center 

medical services 

-0.95 

machine shop 

collecting as a hobby 

-1.04 

railroad track 

teaching 


(b) Top weighted relations between categories and affordances 

Figure 7: Examples of the strongest and the weakest relations 
in the learned KB. (a) Relations between scene classes (left col¬ 
umn) and scene attributes (right column), (b) Relations between 
scene classes (left column) and scene affordances (right column). 
In both (a) and (b), the number at the beginning of each row in¬ 
dicates the actual factor weight in the underlying MRF. The more 
positive the number, the stronger the correlation. We show rela¬ 
tions with the largest positive and negative weights in the KB. To 
be consistent with Fig. 6, we use the same color scheme for at¬ 
tributes and affordances. 

of a pair of labels, each having at least 50 positive samples 
in the test set. Given a set of query labels, we aimed to re¬ 
trieve the test images that are annotated with all the seman¬ 
tic labels in the set. We compare with two nearest neighbor 
baseline methods [18]. NNall ranks the test images based 
on the minimum Euclidean distance to any individual pos¬ 
itive sample in the training set. NNmean ranks the images 
based on the distance to the centroids of the features of the 
positive samples. We report the mean precision at k, the 
mean fraction of correct retrievals out of the top k over all 
queries, where k goes from 1 to 50. As shown in Fig. 8, our 
method outperforms both simple nearest neighbor baselines 
when k > 5. NNmean performs better than ours among the 
top five retrievals; however, the false positive rate grows as 
the number of retrievals increases. In contrast, the relations 


relaxing, beach 



Figure 8: (a) Performance variations of top k retrievals We 

compare our method with two nearest neighbor baselines. In con¬ 
trast to these two methods, the KB model maintains a steady per¬ 
formance on lower-ranked retrievals, (b) Top retrievals of exam¬ 
ple queries. We show top four retrievals from three sample queries 
(in bold) by our KB model. The green boxes indicate correct re¬ 
trievals, and red ones indicate incorrect retrievals. 

in the KB compensate the weak and noisy visual signals, 
and, as a result, maintain stable and good performance on 
lower-ranked retrievals. 

7. Conclusion 

This paper presents a principled framework to perform 
learning and inference on a large-scale multimodal knowl¬ 
edge base (KB). Our contribution is to build a scalable KB 
to answer a variety of visual queries without re-training. 
Our KB is capable of making predictions on a number of 
standard vision tasks, on par with state-of-the-art models 
trained specifically for those tasks. In addition to these 
custom-trained classifiers, it is also interesting to explore 
these knowledge representations as an attempt towards tack¬ 
ling complex queries in real-world vision applications. Fur¬ 
thermore, this platform can be used to explore image-based 
reasoning. Towards these goals, future directions include a 
tighter integration between language and vision, and a more 
robust model for incorporating richer information. 
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A. Scalable Knowledge Base Construction 

There are three key steps to make the knowledge base 
construction (KBC) scalable: data pre-processing, factor 
graph generation and high-performance learning. Sec. 4.1 
provides an overview of the KBC process illustrating these 
three steps. Here we provide more detailed explanations of 
our knowledge base construction pipleline. 

A.l. Database Schema 

The first step (the first box in Fig. 3) is to pre-process raw 
data into a structured representation. This representation 
enables us to perform structured queries (e.g. SQL) on the 
data. We provide the complete database schema in Fig. 9. 
The schema contains two types of tables: data tables con¬ 
tain the entities in Sec. 3.2 that are used to build the knowl¬ 
edge base (KB); metadata tables provide auxiliary infor¬ 
mation for the experiments and visualization, sample Jd in 
Fig. 9 is a unique identifier of each training sample. These 
identifiers are used as a distribution key in the database sys¬ 
tem, where the data is distributed across segments as per the 
distribution keys. 

Each data table stores entities of a certain type. We have 
a separate table for each of the four entity types in Sec. 3.2, 
where continuous values (image features) are stored as dou¬ 
ble precision numbers, and discrete values (scene category, 
affordance and attribute labels) are stored as bigint. We 
have seen in Sec. 4.1 that each row in the data tables corre¬ 
sponds to a variable in the factor graph. Thus the entities in 
Sec. 3.2 can be represented by different types of variables. 
We use 4096 continuous variables to represent an Image en¬ 
tity by its feature extracted from a fine-tuned CNN [54]. We 
use a multinomial variable to represent a scene category la¬ 
bel, and Boolean variables to represent each of the attribute 
labels and affordance labels. 

A.2. Runtime environment 


image features 

id bigint 

samplejd bigint * 
dimension bigint 
feat double precision 


scene affordances 

id bigint 

samplejd bigint * 

affordancejd bigint 
label bigint 


scene attribute names 

attribute Jd bigint 

name text 


scene affordance names 

affordancejd bigint 

name text 


scene categories 

id bigint 

samplejd bigint * 

category bigint 

level bigint 


scene attributes 


id bigint 

samplejd bigint * 

attribute Jd bigint 

label bigint 


train test split 

samplejd bigint 

holdout boolean 


scene categories names 

category bigint 

name text 


Figure 9: Database schema for structured representation. The 

table names (in bold), column names (left) and data types (right) 
are provided. The blue boxes denote data tables containing KB 
entities; and the green ones denote metadata tables. The id column 
is a unique identifier for each row, which is used to create the factor 
graph. The stars (*) indicate the distribution keys for parallel data 
processing. 


a group of rules, where each rule Rj is a set specified with 
first-order logic formulas. 

We first explain an example rule. We then describe the 
general form of the rules later. In Fig. 3 we have shown that 
our KBC system creates a factor in the factor graph of im¬ 
age II from the rule hasAf f ordance(ll, travel) A 
hasAttribute(l 1, sunny), which describes the co¬ 
occurrence between the affordance label travel and the 
attribute label sunny. We use the same example to show 
how factors are generated from the declarative language. In¬ 
stead of writing rules for each of the affordance-attribute 
pair, we can simply write a rule: 


The knowledge base construction is conducted on a Non- 
Uniform Memory Access (NUMA) machine [55] with four 
NUMA nodes. Each has 12 physical cores and 24 logi¬ 
cal cores, with Intel Xeon CPU@2.40GHz and 1TB main 
memory. We choose Greenplum as the underlying database 
system due to its power in massive parallel data processing. 4 

A.3. Human-readable Rules 

To define the KB with ease, we develop a declarative 
language, which serves as a human-readable interface for 
specifying the KB structure. The syntax of the declarative 
language is an extension to first-order logic in order to ac¬ 
commodate continuous variables. We introduced in Sec. 3.2 
three types of relations. We define each type of relations by 

4 http ://www.pivotal.io/big-data/ 
pivotal-greenplum-database 


{(i, w(x, y), 1) | hasAffordance(i, x ) A hasAttribute(i, y )} 

where i , x and y correspond to the variables of im¬ 
ages, affordance labels and attribute labels respectively. 
This rule can be instantiated by assigning values to these 
variables. One possible assignment is to set i to im¬ 
age II, x to travel and y to sunny. This creates 
a factor in the factor graph of image II, where the fac¬ 
tor value is 1 when hasAffordance(ll, travel) A 
hasAttribute(l 1, sunny) holds and 0 otherwise. It 
evaluates to 0 in the example of Fig. 3, as image II does 
not have attribute sunny. Under such variable assignment, 
the weight assigned to the factor is w (travel, sunny). 
It indicates that this weight will be shared by all the 
factors (one for each training image) that depict the 
co-occurrence between the affordance travel and the 
attribute sunny. This rule indicates that image II 

























should have both hasAf fordance (II, travel) and 
hasAttribute (II, sunny) to be true with a confi¬ 
dence score of ^(travel, sunny). Similarly, the corre¬ 
sponding factors for other images share the same weight 
it;(travel, sunny). More generally, each rule Rj corre¬ 
sponds to a set in a given possible world I : 

!( R j) = {(x,w(y),f(z))} (6) 

where x,y,z are sets of variable in the domain (the set of 
all possible values the variables can take), and w(-) and 
/(•) are real-valued functions. Here /(•) essentially defines 
factors in the factor graph model and w(-) defines the cor¬ 
responding factor weights (see Sec. 3.1). The arguments 
to /(•) define the variables required to compute the factor 
value. The arguments to w(-) define how the factor weights 
are shared across the factors. 

All three types of relations in Sec. 3.2 can be specified 
as rules written in this declarative language. Fig. 10 pro¬ 
vides a complete list of rules that we have used to build the 
visual KB. To be more specific, we express image - label re¬ 
lations using two sets of rules corresponding to 1) the linear 
terms, where the factors return the image feature values of 
each dimension; and 2) the bias terms, where the factors 
return a constant 1. For intra- and inter-correlations, we 
express them as conjunctions of two predicates, where the 
factors return 1 if both labels take the same Boolean value 
(either true or false), and 0 otherwise. In total, the proposed 
declarative language enables us to define the KB structure 
with eighteen first-order logic rules. Our KBC system auto¬ 
matically parses these rules, and creates a factor graph (see 
the second box in Fig. 3). Now we have the structure of the 
factor graph model, the next step is to learn the model pa¬ 
rameters (i.e., factor weights). We will talk about the details 
of learning and inference in the next section. 

A.4. Learning and Inference 

In this section we provide more technical details about 
learning and inference in our KB. 

A.4.1 Learning 

The factor graph model in Sec. 3.1 is an instance of standard 
energy-based probabilisitic models [2 ] where the energy 
function E(I) is defined through a linear combination of 
factors: 

m 

E(I) = Wifi(I) (7) 

i =1 

A standard approach to learning is to optimize the negative 
log-likelihood of the training data in Eq. (3). Due to the 
intractability of computing the analytical gradients, sam¬ 
pling is a common practice to estimate the log-likelihood 
gradients. The gradient approximation used in Eq. (4) is 


a special case of contrastive divergence [19], called CD-I. 
Namely, instead of waiting for the Markov chain to con¬ 
verge, we obtain a sample after only one step of Gibbs sam¬ 
pling. This significantly reduces the cost of gradient compu¬ 
tation per step, and has shown effective in several learning 
tasks [4, 19]. We illustrate in Fig. 3(d) that we create a fac¬ 
tor graph for each image. This process is sometimes called 
grounding in the literature [38]. During training we treat 
these small factor graphs as a single large factor graph. The 
variables are mixed and shuffled before sampling. A weight 
update is performed at each Gibbs sampling step. 

A.4.2 Inference 

The inference task is to derive the marginal probabilities 
of a conjunctive query in Eq. (5). This problem can be 
regarded as computing the expectation of a real function 
/ : X -A M given the probability distribution of possible 
worlds / G X: 

E[/;w] = ^Pr[I;w]/(/) (8) 

iex 

where Pr[7; w] is the probability of a possible world / de¬ 
fined in Eq. (2), and X is the set of all possible worlds. 
Computing the exact expectation in Eq. (8) is intractable in 
general factor graphs, which requires summing over a large 
(or even infinite) number of variable assignments. Gibbs 
sampling is a commonly used method for approximate in¬ 
ference. 

The Gibbs sampling starts with an initial world l(°\ 
For each random variable Vk in the factor graph, we sam¬ 
ple its new value v' k from the conditional distribution 
Vr[vk\MB{vk)] w], where MB(v ) is the Markov blanket 
of the variable v. In the context of factor graphs [24], 
the Markov blanket of a variable is the set of factors 
that are connected to the variable. The sampler then 
moves to the next variable. After m rounds of iterations, 
we have sampled a collection of possible worlds ft = 
{/(°), /C 1 ),..., We thus approximate the expecta¬ 

tions of a query q in Eq. (8) over Q: 

,. rri 

= (9) 

m . 

i=i 

where q(I) is the value of the conjunctive query q in pos¬ 
sible world /. To be specific, q{I) evaluates to 1 if all the 
predicates in the query q are true in the possible world /, 
and 0 otherwise. After sufficient iterations, the probability 
of an answer to the query can be estimated by the number of 
iterations in which it takes that value over the total number 
of iterations. 


Image-label relations 

image features & scene category 

{(i,w(d),f) | sceneCategory(i,c) AhasFeature(i,d, f) } 

{(i,w(c),l) | sceneCategory(i,c)} 

image features & scene affordance 

scene_affordance_and_scene_features 

{(i,w(a),f) | HasAffordance(i,a) AhasFeature(i,d,f)} 

{ (i,w(a),1) | hasAffordance(i,a)} 

image features & scene attribute 

{(i,w(d),f) | hasAttribute(i, a) AhasFeature(i,d,f) } 

{(i,w(a),l) | hasAttribute(i, a) } 

Intra-correlations 

affordance & affordance 

{((i,al,a2), w(al,a2), 1) | hasAffordance(i,al) A 

hasAffordance(i,a2)} 

{((i,al,a2), w(al,a2), 1) | !hasAffordance(i, al)A 

!hasAffordance(i, a2)} 

attribute & attribute 

{((i,al,a2),w(al,a2),l) | hasAttribute(i,al) A 

hasAttribute(i,a2)} 

{((i,al,a2),w(al,a2),l) | !hasAttribute(i,al) A 

!hasAttribute(i,a2)} 

Inter-correlations 

category & attribute 

{ ((i,c, a) , w (a, c) , 1) | sceneCategory(i, c) A 

hasAttribute(i, a)} 

{ ( (i,c,a), w (a, c) , 1) | sceneCategory(i, c) A 

!hasAttribute(i, a)} 

{ ((i,c, a) , w (a, c) , 1) | ! sceneCategory (i, c) A 

hasAttribute(i, a)} 

{ ( (i,c,a), w (a, c) , 1) | !sceneCategory(i, c) A 

!hasAttribute(i, a)} 

category & affordance 

{ ((i,c, a) , w (a, c) , 1) | sceneCategory(i, c) A 

hasAffordance(i, a)} 

{ ( (i,c,a), w (a, c) , 1) | sceneCategory(i, c) A 

!hasAffordance(i, a)} 

{ ((i,c, a) , w (a, c) , 1) | !sceneCategory (i, c) A 

hasAffordance(i, a)} 

{ ((i,c,a) , w (a, c) , 1) | !sceneCategory(i, c) A 

!hasAffordance(i, a)} 

Figure 10: The complete list of rules for the visual knowl¬ 
edge base construction. We build our visual knowledge 
base with the rules above. ! denotes negation and A de¬ 
notes conjunction. The formal semantics of the rules are 
described in Sec. A.3. 

B. Query Answering Application Setup 

In Fig. 5, we have provided six query examples that il¬ 
lustrate the diversity of tasks our KB system can handle. In 
order to answer these diverse types of queries, it requires a 
fusion of information from various sources. In practice, we 
aggregate information from online databases, business and 
travel websites, etc. We provide the detailed experimental 
setups and the data sources here. 

We augment our KB in Sec. 3.2 with a new set of geo- 
tagged images and several types of metadata. We briefly 
introduce the extra data sources that we used for this exper¬ 


iment in Sec. 6.1. We randomly sample from FlickrlOOM 5 a 
pool of 20k images with geo-tags and timestamps. Besides 
these images, we incorporate additional information by ei¬ 
ther downloading from existing databases or crawling from 
the web. All the information is stored in a structured format 
as database tables (Sec. A.l). 

1. We obtain a list of names and dates of 327 pub¬ 
lic holidays from Freebase 6 [3] from the instances of 

/t ime/hoi iday_category/holidays. 

2. We scrape business information from Yelp.com and 
Hotels.com. We have crawled in total over sixteen 
thousand entries of business information, including 7k 
bars, 6k shopping centers and 3k hotels. 

3. We download the daily temperature and weather data 
from National Climatic Data Center. Climate Data On¬ 
line 7 (CDO) provides free access to global historical 
weather and climate data. 

4. We download the publicly available GeoNames geo¬ 
graphical database 8 , which maps geolocations to over 
eight million place names. 

We introduce new predicates in Fig. 11 (Boolean-valued 
functions) that enable us to query with these additional data. 
The semantics of these new predicates can be easily in¬ 
ferred from the predicate names and input variables. For 
instance, the predicate hasLocat ±on(img, latlongl) 
evaluates to true if the image img was annotated 
with the geo-location latlongl and false otherwise; 
nearBy (latlongl, latlong2 , 1km) evaluates to true if the 
two geo-locations are within 1km away and false otherwise. 
Having defined the predicates, we use the augmented KB to 
answer the queries in Fig. 5. We list the conjunctive queries 
for each of the six example queries in Fig. 11. The predi¬ 
cates in each query are connected by logical conjunctions. 
Therefore the query evaluates to 1 if and only if every pred¬ 
icate in the query is true, and 0 otherwise, answer (•) 
indicates the return variables, i.e., the target answers to the 
queries. We retrieve a ranked list of the answers by com¬ 
puting a marginal probability of the queries (see Sec. 5 and 
Sec. A.4). Note that, once these additional metadata are 
incorporated into the KB framework, our system treats im¬ 
ages, existing metadata and these new metadata on an equal 
footing in learning and inference. Therefore, a query can be 
answered by a joint inference with no post-filtering steps. 

Following this approach, we are able to express richer 
and more complex queries by joining different pieces of in¬ 
formation with logical conjunctions. As we can see, the 

5 http ://yahoolabs.tumblr.com/post/8 97 835 81601/ 
one-hundred-mi11ion-creative-commons-flickr-images 

6 https ://www.freebase.com 

7 http ://www.node.noaa.gov/cdo-web/ 

8 http ://www.geonames.org/ 



Q: Find me a modern looking mall near Fisherman’s Wharf. 

hasLocation(img, latlong 1) 
mall (mall, latlong2, zip) 
geoName(Fisherman's Wharf, latlong 3) 
has Attribute^ mg, indoor lighting) 
has Attribute (i mg, glossy) 
nearBy(/a£/ongl, latlong2 , 1km) 
nearBy(/a£/ongl, latlong3 , 2 0km) 

=> answer(img, mall , tip) 

Q: Find me a place in Boston where I can play baseball. 

has Affordance (i rag, playing baseball) 
hasLocation(img, latlong 1) 
geoName(Boston, latlong 2) 
nearBy(/a£/ongl, latlong2 , 1km) 

=> answer(irag, latlong 1) 

Q: Find me a hotel in Boston with new furniture. 

hasLocation(irag, latlong 1) 
has Attribute (imp, glossy) 
geoName(Boston, latlong 2) 
nearBy(/a£/ongl, latlong2 , 2 0km) 
hotel(/m£e/, latlong2, date, price, phone) 

=> answer(irag, hotel,price,phone) 

Q: Find me a cozy bar to drink beer near the AT&T Plaza. 

has Attribute (imp, cluttered space) 
hasLocation(img, /ai/ongl) 
bar(6ar, latlong2, price, phone) 
geoName(AT&T Plaza, latlong3) 
nearBy(/a£/ongl, latlong2, 1km) 
nearBy(/a£/ongl, latlong3, 1km) 

=k answer(irag, bar, price, phone) 

Q: Find me a sunny and warm beach during Christmas Day 2013. 

sceneCategory(irag, beach) 

has Attribute (imp, sunny) 

has Attribute (imp, warm) 

hasLocation(irag, latlong 1) 

geoName(/ ocation, latlong2) 

nearBy(/a£/ongl, latlong2, 1km) 

temperature (location, degree, 2013/12/25) 

=> answer(irag, location, degree, latlong2) 

Q: Find me pictures of sunny days of Seattle during August. 

has Attribute (imp, sunny) 
hasLocation(irag, latlong 1) 
hasDate(img, day, August, year) 
geoName(Seattle, latlong2) 
nearBy(/a£/ongl, latlong2, 2 0km) 

=k answer(irag, day, August, year) 

Figure 11: Conjunctive queries for the query answering ex¬ 
amples in Fig. 5. We omit the conjunction symbols (A) be¬ 
tween predicates for neatness. 



train station 

transportation 
transportation and 
material moving 
work, walking, 
travel, relaxing 



pagoda 

cultural or historical 
religious education, 
attending museums, 
attending religious 
services, socializing 

bar 

shopping and dining 
purchasing food, 
extracurricular club 
activities, socializing, 
sales work 



herb garden 

gardens and farms 
farming, lawn / garden 
& plant care, hobbies, 
grounds cleaning and 
maintenance work 

carrousel 

leisure spaces 
playing with children, 
volunteer at event, 
looking after children, 
relaxing 

living room 

home or hotel 
watching television & 
movies, telephone calls, 
interior home cleaning, 
listening to music 


Figure 12: Sample affordance annotations in the augmented 
scene dataset. We augment the SUN dataset [ 0] with a lexicon of 
227 affordances. We provide the fine-grained category (in bold), 
the basic-level category and a subset of their affordance annota¬ 
tions. 


query language in Sec. 5 is capable of expressing a wide 
range of queries. Moreover, these queries can be answered 
in a principled manner, by evaluating marginals in the joint 
probability model. Given such a flexible framework, data 
becomes the key to extend our model’s power of answering 
real-world questions. We are interested in exploring more 
efficient and automatic ways to aggregate information from 
large-scale multimodal corpora for future work. 

C. Affordance Annotations 

We augment the SUN dataset [50] with additional an¬ 
notations of scene affordances. We use a lexicon of 227 
affordances (actions) from the American Time Use Sur¬ 
vey (ATUS) [40] sponsored by the Bureau of Labor Statis¬ 
tics, which catalogs the actions in daily lives and represents 
United States census data. The original ATUS lexicon in¬ 
cludes 428 specific activities organized into 17 major activ¬ 
ity categories and 105 mid-level categories. We re-organize 
the categories by collapsing visually similar superordinate 
categories into one action. For instance, the superordinate- 
level category “traveling” was collapsed into a single cate¬ 
gory because being in transit to go to school should be vi¬ 
sually indistinguishable from being in transit to go to the 
doctor. This results in 227 actions in total. Fig. 12 shows 
six example images with a subset of their affordance anno¬ 
tations. 

The lexicon covers a broad space of possible actions that 
could take place in scenes. We conducted a large-scale on¬ 
line experiment with over 400 AMT workers annotating the 
possibilities of the 227 actions for each of the 298 scene cat¬ 
egories (Sec. 3.2). 10 votes are collected for each category- 
affordance pair. Positive (> 3 votes) and negative (< 2 
votes) annotations are selected as evidence. These 227 af¬ 
fordances are listed in alphabetic order below: 



A appliance repair & maintenance (self), architecture and engi¬ 
neering work, arts & crafts, arts & crafts with children, arts / de¬ 
sign / entertainment / sports / media work, attending child’s events, 
attending meetings for personal interest, attending movies, attend¬ 
ing museums, attending or hosting parties, attending religious ser¬ 
vices, attending school-related meetings & conferences, attending 
the performing arts 

B banking, biking, boating, bowling, building & repairing furni¬ 
ture, building and grounds cleaning and maintenance work, busi¬ 
ness and financial operations work, buying / selling real estate 

C camping, civic obligations, cleaning home exterior, collecting 
as a hobby, community and social work, comparison shopping, 
computer and mathematical work, computer use (not games), con¬ 
struction and extraction work 

D dancing, doing aerobics, doing gymnastics, doing martial arts 

E eating & drinking, education and library work, education- 
related administrative activities, email, exercising & playing with 
animals, exterior home repair & decoration, extracurricular club 
activities 

F farming / fishing and forestry work, fencing, financial man¬ 
agement, fishing, food & drink preparation, food preparation and 
serving work, food presentation 

G gambling, golfing, grocery shopping 

H health-related self care, healthcare work, helping adult, help¬ 
ing child with homework, hiking, hobbies, home heating / cool¬ 
ing, home security, home-schooling children, homework, house¬ 
hold organization & planning, hunting 

I in transit / traveling, income-generating hobbies & crafts, 
income-generating performance, income-generating rental prop¬ 
erty activity, income-generating selling activities, income¬ 
generating services, installation / maintenance and repair work, 
interior decoration & repair, interior home cleaning 

J job interviewing, job search activities 

K kitchen & food clean-up 

L laundry, lawn / garden & plant care, legal work, listening to 
music (not radio), listening to radio, looking after adult, looking 
after children 

M mailing, maintaining home pool / pond / hot tub, management 
/ executive work, military work 

N non-veterinary pet care 

O obtaining licenses & paying fees, obtaining medical care for 
adult, obtaining medical care for child, office and administrative 
work, organizing & planning for adults, organizing & planning for 
children, out-of-home medical services 


P participating in aquatic sports, participating in equestrian 
sports, participating in rodeo, personal care and service work, 
physical care of adults, physical care of children, picking up / 
dropping off adult, picking up / dropping off child, playing base¬ 
ball, playing basketball, playing billiards, playing football, play¬ 
ing games, playing hockey, playing racquet sports, playing rugby, 
playing soccer, playing softball, playing sports with children, play¬ 
ing volleyball, playing with children (not sports), production work, 
protective services work, providing medical care to adult, provid¬ 
ing medical care to child, purchasing food (not groceries), pur¬ 
chasing gasoline 

R reading for personal interest, reading with children, relaxing, 
religious education, religious practices, rock climbing / caving, 
rollerblading / skateboarding, running 

5 sales work, school music activities, science work, security 
screening, sewing & repairing textiles, sexual activity, shopping 
(except food and gas), skiing / ice skating / snowboarding, sleep¬ 
ing, socializing, storing household items, student government 

T taking class for degree or certification, taking class for per¬ 
sonal interest, talking with children, telephone calls, tobacco use, 
transportation and material moving work, travel, using cardiovas¬ 
cular equipment 

U using clothing repair & cleaning services, using home repair 

6 construction services, using in-home medical services, using in¬ 
terior home cleaning services, using lawn & garden services, using 
legal services, using meal preparation services, using other finan¬ 
cial services, using paid childcare services, using personal care 
services, using pet services, using police & fire services, using 
professional 



